Skip to content

Conversation

ddh0
Copy link
Contributor

@ddh0 ddh0 commented Oct 15, 2025

Add support for zai-org/GLM-4.5V vision model to llama.cpp. I currently only plan to support images + text, no video inputs in this PR.

The architecture is Glm4vMoeForConditionalGeneration ("model_type": "glm4v_moe"). Internally, this consists of an LLM (text model) and a ViT (vision adapter / multimodal projector):

LLM (text model glm4v_moe_text)

  • Based on GLM-4.5-Air
  • Tensor names start with model.language_model.
  • Uses a "multimodal 3D RoPE" - in apply_multimodal_rotary_pos_emb, it applies rotary embeddings across temporal, height, and width dimensions for visual tokens

ViT (vision adapter glm4v_moe)

  • Adapted from apple/aimv2-huge-patch14-336:
    • Architecture Aimv2VisionModel
    • ~681M params
    • 24 layers
    • hidden_size (n_embd): 1536
    • intermediate_size (n_ff): 4096
    • image_size: 336
    • patch_size: 14
    • num_channels: 3
    • depth: 24
  • Tensor names start with model.visual.
  • Its 2D positional embeddings are dynamically adapted via bicubic interpolation within the Glm4vMoeVisionEmbeddings module to handle varied image resolutions
  • It also applies its own rotary position embeddings within the self-attention blocks (via apply_rotary_pos_emb_vision)

Other notes:

  • Native context length is 65,536 (as opposed to 131,072 for GLM-4.5-Air)
  • RoPE theta (θ): 10,000.0 (as opposed to 100,000.0 for GLM-4.5-Air)
  • The model supports video input, but I currently do not plan to support video input in this PR (images only)
  • Tokenizer has video-related special tokens - need to handle these during conversion

References:

See also:

@github-actions github-actions bot added the python python script changes label Oct 15, 2025
@ddh0
Copy link
Contributor Author

ddh0 commented Oct 17, 2025

So, it turns out that vision in this model is based on Qwen3-VL, which still needs support from llama.cpp. I am pretty familiar with llama.cpp in general but not with mtmd, so I may not be able to get this PR done on my own. I will keep trying to hack at it when I have time, and I would appreciate any help I could get. :)

Also just saw this thread (#16207) in which someone has posted a patch to get Qwen3-VL kinda-sorta-working in llama.cpp. I will take a look at that too and see if it is helpful - it might make more sense to get Qwen3-VL to a working state in llama.cpp first and only then start working on this PR on top of that. Not sure, just thinking out loud.

@rujialiu
Copy link

Thanks for your work! @ddh0
Based on the commit history, the imports of qwen3vl is the result of a "use qwen data class to avoid repeat again" refactor, so probably it's not quite "based on Qwen3-VL". But anyway, I'm planning to dive into Qwen3-VL and GLM-4.5V later this month and I hope I can help.

@e1732a364fed
Copy link

e1732a364fed commented Oct 18, 2025

Here's my silly implementation for the unfinished mmproj part (class GLM4VMoEVisionModel(MmprojModel):):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        assert self.has_vision_encoder
        assert self.hparams_vision is not None
  
        self.hparams_vision["num_attention_heads"] = self.hparams_vision.get("num_heads")
        self.hparams_vision["num_hidden_layers"] = self.hparams_vision.get("depth")
    def set_gguf_parameters(self):
        # remain code from ddh0 as is
    def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
        del bid # unused
        if name.startswith("model.visual."):
            name = name.replace("model.visual.", "visual.", 1)

            if ".qkv." in name:
                if data_torch.ndim == 2: # weight
                    c3, _ = data_torch.shape
                else: # bias
                    c3 = data_torch.shape[0]
                assert c3 % 3 == 0
                c = c3 // 3
                wq = data_torch[:c]
                wk = data_torch[c: c * 2]
                wv = data_torch[c * 2:]
                return [
                    (self.map_tensor_name(name.replace("qkv", "q")), wq),
                    (self.map_tensor_name(name.replace("qkv", "k")), wk),
                    (self.map_tensor_name(name.replace("qkv", "v")), wv),
                ]

            if name.startswith("visual.downsample."):
                suffix = name.split(".", 2)[2]
                new_name = self.format_tensor_name(gguf.MODEL_TENSOR.V_POST_NORM, suffix=suffix[1])
                return [(new_name, data_torch)]

            yield self.map_tensor_name(name), data_torch
        else:
            return

Then edit gguf-py/gguf/tensor_mapping.py for TensorNameMap to add the following names:

visual.embeddings.position_embedding
visual.merger.proj.weight
visual.merger.up_proj.weight
visual.merger.gate_proj.weight
visual.merger.down_proj.weight
visual.merger.post_projection_norm
visual.post_conv_layernorm.weight
visual.post_layernorm.weight
visual.merger

If you put these in the right place, running

python convert_hf_to_gguf.py /path/to/GLM-4.5V --outfile /path/to/GLM-4.5V-mmproj.gguf --mmproj

will succeed.

But I can only help here. Lacking of knowledge of the model prevents me from digging further. And the converted mmproj won't work until we do it right.

And I believe my treatment for visual.downsample is completely wrong... It's just a copy of other codes... I don't know what I'm doing...

Maybe refer to
https://huggingface.co/yairpatch/Qwen3-VL-30B-A3B-Thinking-GGUF/blob/main/qwen3vl-implementation.patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants